{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Preprocessing of Book-Crossing Dataset\n",
    "\n",
    "The [Book-Crossing dataset](http://www2.informatik.uni-freiburg.de/~cziegler/BX/) contains data about book ratings, books and users collected by Cai-Nicolas Ziegler in a 4-week crawl (August / September 2004)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "\n",
    "import functions as f"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "path = 'data/original/'\n",
    "\n",
    "df_ratings = pd.read_csv(path + 'BX-Book-Ratings.csv', sep=';', encoding='ansi')\n",
    "df_books = pd.read_csv(path + 'BX-Books.csv', sep=';', encoding='ansi', escapechar='\\\\')\n",
    "df_users = pd.read_csv(path + 'BX-Users.csv', sep=';', encoding='ansi')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "for df in [df_ratings, df_books, df_users]:\n",
    "    df.columns = [f.colname_fix(col) for col in df.columns]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Ratings:\n",
      "Number of ratings: 1149780\n",
      "Number of books: 340556\n",
      "Number of users: 105283\n",
      "\n",
      "Number of books: 271379\n",
      "\n",
      "Number of users: 278858\n"
     ]
    }
   ],
   "source": [
    "print('Ratings:\\nNumber of ratings: %d\\nNumber of books: %d\\nNumber of users: %d' % (len(df_ratings),\n",
    "                                                                                     len(df_ratings['isbn'].unique()),\n",
    "                                                                                     len(df_ratings['user_id'].unique())))\n",
    "print('\\nNumber of books: %d' % len(df_books))\n",
    "print('\\nNumber of users: %d' % len(df_users))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "items with non-ascii characters in user_id: 0\n",
      "items with non-ascii characters in isbn: 55\n",
      "items with non-ascii characters in book_rating: 0\n",
      "\n",
      "items with non-ascii characters in isbn: 0\n",
      "items with non-ascii characters in book_title: 365\n",
      "items with non-ascii characters in book_author: 21\n",
      "items with non-ascii characters in year_of_publication: 0\n",
      "items with non-ascii characters in publisher: 33\n",
      "items with non-ascii characters in image_url_s: 0\n",
      "items with non-ascii characters in image_url_m: 0\n",
      "items with non-ascii characters in image_url_l: 0\n",
      "\n",
      "items with non-ascii characters in user_id: 0\n",
      "items with non-ascii characters in location: 560\n",
      "items with non-ascii characters in age: 0\n",
      "\n"
     ]
    }
   ],
   "source": [
    "f.ascii_check_bulk(df_ratings)\n",
    "f.ascii_check_bulk(df_books)\n",
    "f.ascii_check_bulk(df_users)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Filtering observations\n",
    "* Remove (incorrect) ISBN with non-ascii characters\n",
    "* Use only country instead of whole 'location' data\n",
    "* Remove images' urls\n",
    "* Separate explicit (1-10) and implicit (0) ratings"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_ratings['isbn_check'] = df_ratings['isbn'].apply(f.ascii_check)\n",
    "df_ratings = df_ratings[df_ratings['isbn_check']==0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_users['country'] = df_users['location'].apply(lambda x: x.split(', ')[-1].title())\n",
    "df_users['country_check'] = df_users['country'].apply(f.ascii_check)\n",
    "df_users.loc[df_users['country_check']==1, 'country'] = np.nan"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_ratings.drop(['isbn_check'], axis=1, inplace=True)\n",
    "df_books.drop(['image_url_s', 'image_url_m', 'image_url_l'], axis=1, inplace=True)\n",
    "df_users.drop(['country_check'], axis=1, inplace=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Explicit ratings: 433642\n",
      "Implicit ratings: 716083\n"
     ]
    }
   ],
   "source": [
    "df_ratings_explicit = df_ratings[df_ratings['book_rating']!=0]\n",
    "df_ratings_implicit = df_ratings[df_ratings['book_rating']==0]\n",
    "\n",
    "print('Explicit ratings: %d\\nImplicit ratings: %d' % (len(df_ratings_explicit), len(df_ratings_implicit)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_ratings_explicit.to_csv('data/ratings_explicit.csv', encoding='utf-8', index=False)\n",
    "df_ratings_implicit.to_csv('data/ratings_implicit.csv', encoding='utf-8', index=False)\n",
    "df_books.to_csv('data/books.csv', encoding='utf-8', index=False)\n",
    "df_users.to_csv('data/users.csv', encoding='utf-8', index=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Reducing the dimensionality\n",
    "To reduce the dimensionality of the dataset and avoid running into memory error it will focus on users with at least 3 ratings and top 10% most frequently rated books. It consists of 176,594 records."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Filter: users with at least 3 ratings\n",
      "Number of records: 368563\n"
     ]
    }
   ],
   "source": [
    "user_ratings_threshold = 3\n",
    "\n",
    "filter_users = df_ratings_explicit['user_id'].value_counts()\n",
    "filter_users_list = filter_users[filter_users >= user_ratings_threshold].index.to_list()\n",
    "\n",
    "df_ratings_top = df_ratings_explicit[df_ratings_explicit['user_id'].isin(filter_users_list)]\n",
    "\n",
    "print('Filter: users with at least %d ratings\\nNumber of records: %d' % (user_ratings_threshold, len(df_ratings_top)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Filter: top 10% most frequently rated books\n",
      "Number of records: 176594\n"
     ]
    }
   ],
   "source": [
    "book_ratings_threshold_perc = 0.1\n",
    "book_ratings_threshold = len(df_ratings_top['isbn'].unique()) * book_ratings_threshold_perc\n",
    "\n",
    "filter_books_list = df_ratings_top['isbn'].value_counts().head(int(book_ratings_threshold)).index.to_list()\n",
    "df_ratings_top = df_ratings_top[df_ratings_top['isbn'].isin(filter_books_list)]\n",
    "\n",
    "print('Filter: top %d%% most frequently rated books\\nNumber of records: %d' % (book_ratings_threshold_perc*100, len(df_ratings_top)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_ratings_top.to_csv('data/ratings_top.csv', encoding='utf-8', index=False)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "master",
   "language": "python",
   "name": "master"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
